Transfer Learning Video Classification of Preserved, Mid-Range, and Reduced Left Ventricular Ejection Fraction in Echocardiography

Identifying patients with left ventricular ejection fraction (EF), either reduced [EF < 40% (rEF)], mid-range [EF 40–50% (mEF)], or preserved [EF > 50% (pEF)], is considered of primary clinical importance. An end-to-end video classification using AutoML in Google Vertex AI was applied to echocardiographic recordings. Datasets balanced by majority undersampling, each corresponding to one out of three possible classifications, were obtained from the Standford EchoNet-Dynamic repository. A train–test split of 75/25 was applied. A binary video classification of rEF vs. not rEF demonstrated good performance (test dataset: ROC AUC score 0.939, accuracy 0.863, sensitivity 0.894, specificity 0.831, positive predicting value 0.842). A second binary classification of not pEF vs. pEF was slightly less performing (test dataset: ROC AUC score 0.917, accuracy 0.829, sensitivity 0.761, specificity 0.891, positive predicting value 0.888). A ternary classification was also explored, and lower performance was observed, mainly for the mEF class. A non-AutoML PyTorch implementation in open access confirmed the feasibility of our approach. With this proof of concept, end-to-end video classification based on transfer learning to categorize EF merits consideration for further evaluation in prospective clinical studies.


Introduction
Worldwide, heart failure is one of the most concerning cardiac conditions, characterized by impaired quality of life, serious complications, and a high mortality rate [1].However, progress is constantly observed in therapy, especially since the development of dedicated units ensuring patient follow-up, and numerous observations show that the prognosis can be improved by adapting the therapeutic measures to the impairment of left ventricular function [2][3][4].The left ventricular ejection fraction (EF) is defined as the percentage of blood present at end-diastole in the left ventricle that is ejected during systole.It is the most evaluated parameter for classifying patients.The distinction was historically first made between patients with preserved EF (pEF,) where EF is not less than 50%, and not preserved EF (npEF).More recently, a distinction has been made in npEF patients between those with reduced EF (rEF), where the EF is less than 40%, and those with mid-range EF (mEF), where EF is between 40% and 50% [5,6].Probably more appropriately, mid-range EF is sometimes referred to as mildly reduced EF.
Transthoracic echocardiography, a non-invasive, easily portable, and irradiation-free method, is the most widely used means of assessing EF.However, the method has some limitations compared to other heart-imaging techniques, like contrast ventriculography, computed tomography, magnetic resonance imaging, or positron emission tomography.Obtaining a good estimate of EF requires carrying out a delineation of the left ventricular cavity in end-diastole and end-systole on several consecutive beats, followed by volume estimation using the Simpson method of slice.This is a lengthy process that requires considerable human expertise.
Therefore, many machine-learning (ML) algorithms for EF measurement have been developed [7].These models have mainly been tested on two publicly available labeled datasets, EchoNet-Dynamic [8] and CAMUS [9].Most of these approaches require an initial human and/or ML intervention to determine the contours of the left ventricular cavity.The expertise of the practitioner or the AI performance is essential in this process.
The receiver operating characteristic curve area under the curve (ROC AUC) score in a binary npEF vs. pEF classification is reported in two studies.Using the EchoNet-Dynamic three-component algorithm, the ROC AUC score was 0.97 for the institutional Stanford dataset and 0.96 for an external dataset obtained at Cedars-Sinai Medical Center [10].For the same classification task, ROC AUC scores of around 0.97 were obtained from datasets derived from the CAMUS repository using DPS-Net, a segmentation algorithm, in conjunction with the Simpson biplane method [11].Ten-fold cross-validation was performed in the latter study.More recently, transfer learning was proposed to increase the performance of the part of algorithms allowing performing the segmentation [12,13].
A different approach not yet described to the best of our knowledge consists of submitting the echocardiographic video sequences showing the beating heart without any phase of segmentation to a transfer learning algorithm of video classification.These algorithms are presently accessible either in free open-access packages or in the torchvision.modelssubpackage of PyTorch to run on personal or institutional computer systems or through pay access directly to the cloud via automated machine learning (AutoML) or customized code in Google Vertex AI.
Our objective was to evaluate whether an end-to-end transfer learning video classification, without human or AI determination of the left ventricular location or internal contours, is feasible and performs well in distinguishing patients with pEF, mEF, and rEF.This approach presents a certain analogy with a classification of celestial objects based on the amplitude of their pulsation, hence its name, "PulseHeart".The source of echocardiograms was the EchoNet-Dynamic dataset, which consisted of video sequences labeled by experts using the Simpson method.We created three balanced echocardiographic video datasets of different sizes (small: S-set, medium: M-set, large: L-set).Each set was subjected to three AutoML models with either binary (respectively, rEF vs. nrEF and npEF vs. pEF) or ternary (pEF, mEF, rEF) classification.Finally, we submitted the test video labels of the S-set to human re-evaluation.

Materials and Methods
The high-level design is shown in Figure 1.

Dataset and Data Curation
The datasets were obtained from the EchoNet-Dynamic database, comprising 10,030 echocardiographic video sequences preprocessed, deidentified, and converted from DI-COM to AVI format.Examples of end-diastolic and end-systolic four-chamber views for cases labeled pEF, mEF, and rEF are shown in Figure 2.
In the first curation step, we eliminated the 40 videos that were submitted by the Stanford group for re-evaluation by expert clinicians because of the highest absolute difference between the initial human label and the EchoNet-Dynamic's prediction.In the second step, we restricted the set to the 112 × 112-pixel videos at 50 frames per sec with 100 to 250-frame lengths to obtain an intermediate dataset.

Dataset and Data Curation
The datasets were obtained from the EchoNet-Dynamic database, comprising 10,030 echocardiographic video sequences preprocessed, deidentified, and converted from DI-COM to AVI format.Examples of end-diastolic and end-systolic four-chamber views for cases labeled pEF, mEF, and rEF are shown in Figure 2.

Dataset and Data Curation
The datasets were obtained from the EchoNet-Dynamic database, comprising 10,030 echocardiographic video sequences preprocessed, deidentified, and converted from DI-COM to AVI format.Examples of end-diastolic and end-systolic four-chamber views for cases labeled pEF, mEF, and rEF are shown in Figure 2.  In the third step, the S-set, M-set, and L-set were obtained by majority undersampling as described in Table 1.The size of each dataset is determined by the maximal size of the considered minority (rEF, mEF, and npEF, respectively).
as described in Table 1.The size of each dataset is determined by the maximal size of the considered minority (rEF, mEF, and npEF, respectively).
Unlike the EchoNet-Dynamic algorithm, the video classification by AutoML does not include a validation set; therefore, we grouped the video originally tagged VAL and TEST in the EchoNet-Dynamic database into the test category to obtain a 75/25 train-test split.The study's 3064 videos were uploaded in a Google Cloud Storage bucket folder.

Training
The training was performed using the MoViNet Video Clip Classification (tfvision/movinet-vcn, release date: 20 August 2023), a task-specific solution implemented in Google Vertex AI and pre-trained on the Kinetic 600 dataset.Using Vertex AI training pipelines, we created managed datasets and corresponding AutoML-trained models for video classification (Table 2).Unlike the EchoNet-Dynamic algorithm, the video classification by AutoML does not include a validation set; therefore, we grouped the video originally tagged VAL and TEST in the EchoNet-Dynamic database into the test category to obtain a 75/25 train-test split.The study's 3064 videos were uploaded in a Google Cloud Storage bucket folder.

Training
The training was performed using the MoViNet Video Clip Classification (tfvision/ movinet-vcn, release date: 20 August 2023), a task-specific solution implemented in Google Vertex AI and pre-trained on the Kinetic 600 dataset.Using Vertex AI training pipelines, we created managed datasets and corresponding AutoML-trained models for video classification (Table 2).

Analysis
For performance, we used the following metrics: accuracy, balanced accuracy, PRC AUC score, ROC AUC scores, and, for each class, precision, recall, and F1 score.The confusion matrix graphs were obtained by importing the results in a Python 3 Jupyter notebook with the help of Matplotlib's pyplot module and Seaborn's heatmap module.The Vertex AI output includes graphs of the precision-recall curve (PRC) and precision-recall by threshold curves (PRTC).These will be presented here.By reviewing the videos and noting each prediction, we were able to obtain graphs of ROC curves and calculate the corresponding AUC scores.For this purpose, we used the roc_curve and roc_auc_score modules of scikit-learn in Python.

Clinician Re-Evaluation
This re-evaluation was carried out for the three S-set experiments by a panel of five cardiologists who specialize in cardiovascular imaging (E.D., G.C., L.A., M.M., P.D.).A video subset was formed by grouping all false positives and false negatives observed in the two binary classifications and by adding the rEF cases predicted as pEF and the pEF cases predicted as rEF by the ternary classification.These videos were presented in random order to the panel members, who blindly classified them as rEF, mEF, and pEF.The majority rule was used for ranking, with mEF chosen in the event of a tie.Determination of poor technical quality and the presence of arrhythmia were also based on the majority rule.To explore the role of technical quality and arrhythmia at recording time, we calculated for each case the number of rater reports of degraded quality (Q-score, from 0 to 5) and of suspicion of arrhythmia (A-score, from 0 to 5).We used a two-tailed Student's t-test to assess if these scores were different in underestimation vs. overestimation on EF by at least one of our classifiers.

Binary Classification
The metric values for these six models are shown in Table 3.In general, the performance seems better with the S-set.Since a "positive" medical test means belonging to the class indicating disease, sensitivity is the recall for rEF in the rEF vs. nrEF classification and the recall for npEF in the npEF vs. pEF classification.Likewise, specificity is the recall for nrEF in the rEF vs. nrEF classification and the recall for pEF in the npEF vs. pEF classification.

npEF vs. pEF
Results obtained with the three datasets are shown in terms of confusion matrices in Figure 6 and in terms of PRC and PRTC in Figure 7. Please note that the npEF class is the minority in these models.The L-set obtained by undersampling the pEF class is the appropriate, balanced option for this binary classification (L-NP-P model).

npEF vs. pEF
Results obtained with the three datasets are shown in terms of confusion matrices in Figure 6 and in terms of PRC and PRTC in Figure 7. Please note that the npEF class is the minority in these models.The L-set obtained by undersampling the pEF class is the appropriate, balanced option for this binary classification (L-NP-P model).

ROC Curves for Models with Binary Classification
Figure 8 shows the ROC curves for the six models with binary classification.Pearson's correlation was 0.909 between ROC AUC scores and PRC AUC scores.

ROC Curves for Models with Binary Classification
Figure 8 shows the ROC curves for the six models with binary classification.Pearson's correlation was 0.909 between ROC AUC scores and PRC AUC scores.

Ternary Classification
The metric values for these three models are shown in Table 4. Results obtained with the three datasets are shown in terms of confusion matrices in Figure 9 and in terms of

Ternary Classification
The metric values for these three models are shown in Table 4. Results obtained with the three datasets are shown in terms of confusion matrices in Figure 9 and in terms of PRC and PRTC in Figure 10.Please note that the mEF class is the minority in these models.The M-set obtained by undersampling the rEF and pEF classes is the appropriate option for this ternary classification (M-3-C model).PRC and PRTC in Figure 10.Please note that the mEF class is the minority in these models.
The M-set obtained by undersampling the rEF and pEF classes is the appropriate option for this ternary classification (M-3-C model).

Clinician Re-Evaluation
Table A1 in Appendix A displays the detailed results of the re-evaluation.The recording quality was found to be poor in 29.9% (32 of 107) of these cases of mismatch between label and prediction, and arrhythmia was judged present in 7.5% (8 of 107).For comparison, in the clinician re-evaluation performed on the EchoNet-Dynamic's prediction model [10], the experts rated 32.5% (13 of 40) of videos as having poor image quality and 13% (5 of 40) of videos as showing arrhythmia.
The cross-tables (Figure 11) indicate that the panel judged that the EF was preserved in more cases than established by the original label and that the EF was reduced in fewer cases.However, this is mostly concerned with cases labeled mEF.A video labeled pEF was classified as rEF by three members of the panel, mEF by the fourth, and pEF by the last.No videos labeled rEF were classified as pEF by the majority.Such a paucity of possible "extreme" labeling errors (rEF instead of pEF or vice versa) was also observed when re-evaluating the EchoNet-Dynamic prediction model [10].The discordance table published in that article shows that only 2 out of 40 human rEF labels should be replaced by pEF, and only three pEF labels should be replaced by rEF.

Clinician Re-Evaluation
Table A1 in Appendix A displays the detailed results of the re-evaluation.The recording quality was found to be poor in 29.9% (32 of 107) of these cases of mismatch between label and prediction, and arrhythmia was judged present in 7.5% (8 of 107).For comparison, in the clinician re-evaluation performed on the EchoNet-Dynamic's prediction model [10], the experts rated 32.5% (13 of 40) of videos as having poor image quality and 13% (5 of 40) of videos as showing arrhythmia.
The cross-tables (Figure 11) indicate that the panel judged that the EF was preserved in more cases than established by the original label and that the EF was reduced in fewer cases.However, this is mostly concerned with cases labeled mEF.

Clinician Re-Evaluation
Table A1 in Appendix A displays the detailed results of the re-evaluation.The recording quality was found to be poor in 29.9% (32 of 107) of these cases of mismatch between label and prediction, and arrhythmia was judged present in 7.5% (8 of 107).For comparison, in the clinician re-evaluation performed on the EchoNet-Dynamic's prediction model [10], the experts rated 32.5% (13 of 40) of videos as having poor image quality and 13% (5 of 40) of videos as showing arrhythmia.
The cross-tables (Figure 11) indicate that the panel judged that the EF was preserved in more cases than established by the original label and that the EF was reduced in fewer cases.However, this is mostly concerned with cases labeled mEF.A video labeled pEF was classified as rEF by three members of the panel, mEF by the fourth, and pEF by the last.No videos labeled rEF were classified as pEF by the majority.Such a paucity of possible "extreme" labeling errors (rEF instead of pEF or vice versa) was also observed when re-evaluating the EchoNet-Dynamic prediction model [10].The discordance table published in that article shows that only 2 out of 40 human rEF labels should be replaced by pEF, and only three pEF labels should be replaced by rEF.

GitHub PyTorch Implementation
We conducted an additional series of experiments using open-access software, in which we were able to optimize model parameter selection (Appendix B).

Discussion
The main advantage of PulseHeart is that the training required for the examiner is limited to the ability to record a short video of the transthoracic four-chamber view with sufficient quality.As proof of concept, the results obtained here demonstrate that this endto-end approach is feasible and exhibits good performance.Such an EF classification algorithm, either AutoML or not, can be considered to aid in categorizing heart failure from the perspective of therapeutic guidance and prognostic.
Our aim was not to directly compare the metrics to those of the original Stanford study but to verify whether our proposed approach could exhibit acceptable performance.We chose to undersample the majority class to keep the computational cost within reasonable limits [14], which was necessary to carry out the present study.Before balancing, we selected an intermediate unbalanced dataset from the Stanford source that would be suitable for our proof of concept.We first discarded the tiny fraction (0.4%) of the source dataset for which the label was problematic in the a posteriori re-evaluation by five experts from Stanford [10].We acknowledge that some of these 40 labels were probably correct, but others were judged to be patent label errors.Our intention was, therefore, not to remove difficult cases but rather to obtain a dataset with a high rate of correct labels.Many difficult cases remained in our intermediate dataset, and this was confirmed by our fivecardiologist re-evaluation.We also selected the videos with the same predominant size

GitHub PyTorch Implementation
We conducted an additional series of experiments using open-access software, in which we were able to optimize model parameter selection (Appendix B).

Discussion
The main advantage of PulseHeart is that the training required for the examiner is limited to the ability to record a short video of the transthoracic four-chamber view with sufficient quality.As proof of concept, the results obtained here demonstrate that this end-to-end approach is feasible and exhibits good performance.Such an EF classification algorithm, either AutoML or not, can be considered to aid in categorizing heart failure from the perspective of therapeutic guidance and prognostic.
Our aim was not to directly compare the metrics to those of the original Stanford study but to verify whether our proposed approach could exhibit acceptable performance.We chose to undersample the majority class to keep the computational cost within reasonable limits [14], which was necessary to carry out the present study.Before balancing, we selected an intermediate unbalanced dataset from the Stanford source that would be suitable for our proof of concept.We first discarded the tiny fraction (0.4%) of the source dataset for which the label was problematic in the a posteriori re-evaluation by five experts from Stanford [10].We acknowledge that some of these 40 labels were probably correct, but others were judged to be patent label errors.Our intention was, therefore, not to remove difficult cases but rather to obtain a dataset with a high rate of correct labels.Many difficult cases remained in our intermediate dataset, and this was confirmed by our five-cardiologist re-evaluation.We also selected the videos with the same predominant size and frame rate, which corresponds to most real-world situations.Videos that were too short or too long were discarded because we wanted to evaluate an approach that included standardized recording without possibly biased human decisions to record more or fewer beats than usual.
For any attempt to compare PulseHeart to EchoNet-Dynamic, it would be necessary to use the full 10,030 video set for PulseHeart or to train the EchoNet-Dynamic model with our balanced datasets.This would also require k-fold cross-validation [15], which was not performed for EchoNet-Dynamic and is not available in Vertex AI for what concerns our study.Engaging in this comparison does not appear justifiable because EchoNet-Dynamic does not constitute a corresponding benchmark.It differs primarily in that it is not an end-to-end approach and, therefore, requires the presence of health personnel with significant expertise for the specific task of left ventricular cavity delineation when used for prediction in new patients.PulseHeart can, therefore, be an interesting solution in a certain number of clinical situations.Additionally, EchoNet-Dynamic only considered one type of classification, did not use balanced datasets, and was not tested to distinguish between the three classes of EF that are presently considered of clinical importance.
Some other limitations should be mentioned before interpreting the results in more detail.Limitations exist if one wishes to statistically compare the performance with the S, M, and L sets for the three classifications.Not only would this again involve k-fold cross-evaluation, but the study design was not suitable for this unplanned comparison.Consistent with our primary objective, the PulseHeart was tested for three classification types that could be considered in different clinical settings.The original design was to use the S-set for the rEF vs. nrEF binary classification, the L-set for the npEF vs. pEF binary classification, and the M-set for the ternary classification.We continued our investigation by training the models with the two other datasets for each type of classification, which led to some interesting observations.The same test dataset for the nine models would have been theoretically preferable for analyzing performance in function of the balancing method.The small and medium test datasets contain only videos from the large test dataset.After reduction via the random balancing process, these datasets, on average, continue to reflect their common origin.Therefore, we did not rerun the experiments with the large test dataset.This would have required additional time-consuming and expensive AutoML calculations without significantly improving our metric estimates or changing the validity of the scientific conclusion.
Another limitation is that in the AutoML approach, as well as in commercially available patented applications [16], there is no access to model details or parameter tuning.At the time of our study, Google Cloud Platform Vertex AI included, among the video autoML options, a classification module that was not available in Microsoft Azure or Amazon Web Services SageMaker.AutoML refers to any automated process of building ML models, consisting of searching for a suitable model, tuning hyperparameters, preparing data, and deploying the model.AutoML services are constantly developing, and applications are described in the non-medical and medical domains [17][18][19][20].The main disadvantage of AutoML is the black-box effect.Only very general considerations and examples are provided for the Video Classification on Google Cloud.We know that Vertex AI uses MoViNets and that the pre-training was on the Kineticcs 600 video dataset, but we lack a precise description of how AutoML worked in our experiments, nor can we provide an analysis of the resulting models, including hyperparameter values and training loss curves.However, this opacity was not a real hindrance in the present study aimed at proving the concept and assessing the potential of the PulseHeart approach.AutoML accomplished the task quickly, efficiently, and at a reasonable cost.A further inconvenience of the black-box effect is that there is no way to understand why the completion times of the algorithms are not proportional to the training dataset size, with the M-set being the longest to train.For this, we need to know if, in all AutoML experiments, the model was the same, the hyperparameters were similar, and the training loss curves were equally satisfactory.In our independent PyTorch trial, where these conditions were met, the completion time was proportional to the size of the dataset and the size of the model.This strengthens the hypothesis of different-sized models to explain the observation in AutoML experiments.
Despite these limitations, this study demonstrates that the PulseHeart concept is feasible, with a performance level generally judged acceptable in the field of medical imaging, where imperfect labeling and a relatively low number of observations are frequently encountered.In general, the best-performing dataset was the smallest (1658 videos), balanced for the 829 rEF cases representing the minority.The largest dataset (3064 videos) is balanced for npEF and is the second best in terms of performance.
The global metrics used in this study deserve some comments.Accuracy is a wellrecognized metric, but the F1 score is more representative in a dataset that is not fully balanced.The observed values of these two metrics do not differ much in the models trained here.Likewise, the PRC AUC score could be preferred to the ROC AUC score, which is the usual reference value.As expected, the two are deeply connected in our observations, but PR curves are deemed more informative when dealing with skewed datasets [21].For the ROC curve graphs of the npEF vs. pEF classification, we considered npEF positive because, in medicine, a test result indicative of illness is reported as positive by convention.
If we look at the class metrics, we can draw more detailed conclusions.For the binary classification to identify rEF (S-R-NR model), the positive predictive value (precision 1) is 0.842, the sensitivity is 0.894, the F1 score is 0.867, and the specificity is 0.831.The high sensitivity is a satisfactory result.Indeed, rEF cases should be missed as little as possible in a general patient population, as they can be the warning sign of serious cardiac problems.The specificity is such that few nrEF patients would undergo further cardiac investigations.For the binary classification to identify npEF (L-NP-P model), the positive predictive value (precision 1) is 0.888, the sensitivity is 0.761, the F1-score is 0.817, and the specificity is 0.891.
We observed, in general, lower metric values for the ternary classification.This was essentially marked for the class metrics concerning mEF.In contrast, in the M-3-C model where dataset balance and classification match, only 1 out of 169 pEF videos was classified as rEF, and only 7 out of 169 rEF were classified as pEF.
A binary approach is often preferred in ML applications for medical diagnosis concerning disease stages.For algorithms without transfer learning in Alzheimer's disease, a ternary classification was described using constructed cascaded convolutional neural networks [22], and several attempts of binary classification for disease vs. the normal status were reported, with, for instance, a model using a 3D VGG variant convolutional network [23].For grading glioma, a modular deep-learning pipeline using an ensemble of convolutional neural networks was proposed, realizing two binary classifications in a cascade [24].However, these authors state that the worst tumor stage (grade IV) is clearly distinct from grades III and II, which are less easy to separate.A variety of other deeplearning algorithms were tested for image multi-classification of brain tumors, as reviewed in an article where the authors proposed their own models [25].In the present study, the attempts of ternary classification might be hampered by the skewed smooth EF distribution where multimodality does not appear clearly (Figure 3).Our clinical re-evaluation also points to the difficulties of isolating a gray zone between frankly pathological states and normality, even in the presence of reliable labels.
We mentioned that the side observations obtained in the six models where the way of balancing does not match the type of classification must be interpreted with caution.Nevertheless, we cannot neglect that, as shown in Tables 3 and 4, improved metrics are sometimes observed in these mismatches.Figure 1c shows that the majority of npEF cases are in the S-NP-P and M-NP-P models.They have a higher sensitivity for identifying npEF than the L-NP-P model balanced by undersampling pEF.However, their lower specificity implies that many patients with pEF would undergo more extensive cardiac investigations.Only statistically robust comparisons, ideally based on k-fold cross-validation, are suitable to determine a dataset balancing strategy adapted to a given clinical situation.These should be included in future attempts to improve the models through a non-AutoML approach.
The power of transfer learning in medical imaging based on a pre-training of colored images is illustrated by the performances reported in a wide spectrum of medical conditions using grayscale images produced by techniques such as computed tomography, mammography, magnetic resonance imaging, or chest X-rays [26][27][28].Similarly, in the present study, the model was pre-trained on color video sets and applied with success to grayscaled clips.
We can wonder why this algorithm works without providing any information about the left ventricular cavity or the heart's cyclic evolution.The first point is to note that the four-chamber view is, by convention, pre-framed by the performer of the examination and oriented such that the left ventricle occupies approximately the same position on each video.The almost echo-free zone generated by the presence of ultrasound-permeable blood in the cavity can thus be tracked by the algorithm.This is only a conjecture because here, we lack a visual explanation by means of the saliency methods that, in static medical images, allow localizing the region of interest of the automated search process on a heatmap [29][30][31][32].It is also possible that other cyclic phenomena that affect the cardiac structures contribute to the confidence score, such as the left ventricular wall contraction, the complex relationship between the systolic and diastolic function, the left atrial filling and emptying, or the degree of interaction between the right and left ventricle.Another point is that the principle of EF calculation from a four-chamber echocardiographic view is based on a three-dimensional model extrapolated from a single plane.Ultrasound images are constructed from the reception of echoes coming from a slice perpendicular to the surface of the transducer and determined by its orientation.These echoes, particularly the backscattered ones, carry clues about the part of the heart adjacent to the reconstructed slice.This information can be correlated with the patient's EF.Furthermore, the power of transfer learning to classify cyclic events on video is well demonstrated in action recognition tasks.Therefore, the PulseHeart approach, which uses a "scene recognition task", can take advantage of the repetition of the cardiac cycle in the video sequences.
For future research, models based on the PulseHeart approach can be developed using SDKs such as TensorFlow or PyTorch.This can be done locally or on a cloud ML platform.Unreasonable costs encountered in commercial platforms could be avoided, especially if k-fold cross-validation is required and extensive experimentation is needed to refine the models.As we demonstrate in Appendix B, training loss curves can then be observed, hyperparameters tuned, and different kernels tested.Overfit can be detected more easily by introducing a validation phase into the training process.We can consider combining other data modalities (such as electrocardiogram, NT-proBNP, chest X-ray, or MRI) to enhance the performance of the classifier.Testing the predictions on external validation datasets from different centers may be considered to assess the generalization ability of this approach.One conceivable development is to articulate classical ML SDKs with quantum computing SDKs such as Pennylane and Qiskit to build a transfer learning classical-quantum hybrid video classifier.For example, the "extended" circuit version of MoViNet that we built on PyTorch allows the insertion of a doctored parameterized quantum circuit.The complexity of the videos and the possibility of interactions between the changing cardiac structures during the cardiac cycle are reasons to expect increased trainability and/or performance in such hybrid models [33][34][35].These were already tested for static grayscale images in the medical field [36][37][38][39].Transfer learning video classification of EF may be considered for other procedures, like transesophageal echocardiography, contrast ventriculography, magnetic resonance imaging, and computed tomography.Institutional Review Board Statement: Ethical review and approval were waived for this study due to the fact that all videos were downloaded from an anonymized shared dataset.
Informed Consent Statement: Patient consent was not considered nor possible because we used an anonymized dataset.

Data Availability Statement:
Files containing the managed dataset, the annotation sets, and the detailed re-evaluation ratings are available online: https://github.com/pulseheart/PulseHeart-AutoML(accessed on 12 October 2023).The Pytorch implementation with Python notebooks under Apache License 2.0 can be found at https://github.com/pulseheart/PulseHeart-PyTorch(accessed on 4 April 2024)).The videos can be downloaded from the Stanford AIMI Shared Dataset https: //echonet.github.io/dynamic/index.html(accessed on 12 October 2023) after login and agreeing to the Stanford University Dataset Research Use Agreement.A description of the video clip classification models can be found on GitHub at the official site of MoViNet https://github.com/tensorflow/models/blob/master/official/projects/movinet/README.md(accessed on 12 December 2013).

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix B.3. Metrics
The metric values are presented in Table A3.For each classifier, the highest accuracy, ROC AUC scores, and PR AUC scores are observed with the model MoViNet A0 modified.Figures A3 and A4 show the ROC and PR curves.

Appendix B.3 Metrics
The metric values are presented in Table A3.For each classifier, the highest accuracy, ROC AUC scores, and PR AUC scores are observed with the model MoViNet A0 modified.Figures A3 and A4 show the ROC and PR curves.
Figure 3 shows the distribution of the EF values of this intermediate dataset of 7380 videos.

Figure 1 .
Figure 1.Study flowchart: (a) data curation; (b) upload the study-managed dataset to Google Cloud Storage; (c) balanced data sets S, M, and L, the size of which depends on the minority class (rEF, mEF, and npEF, respectively); (d): video classifications to test; (e): create a per-model annotation dataset and upload the corresponding CSV files (one for the train and one for the test videos); (f) specify the model to be trained; (g) run the model for training; (h) analyze the test set; (i) human panel reassessment of false positives and false negatives.

Figure 2 .
Figure 2. Frames from videos pertaining to the study dataset.End-diastolic frames are shown on the upper part of the slide and end-systolic frames below: (a) 0X2753C50A8B05D7D5.avi,label pEF, EF 58.3% by Simpson method; (b) 0X2F3141F00A232601.avi,label mEF, EF 42.1% by Simpson method; (c) 0X41563E2CC2230C0E.avi,label rEF, EF 21.9% by Simpson method.

Figure 1 .
Figure 1.Study flowchart: (a) data curation; (b) upload the study-managed dataset to Google Cloud Storage; (c) balanced data sets S, M, and L, the size of which depends on the minority class (rEF, mEF, and npEF, respectively); (d): video classifications to test; (e): create a per-model annotation dataset and upload the corresponding CSV files (one for the train and one for the test videos); (f) specify the model to be trained; (g) run the model for training; (h) analyze the test set; (i) human panel reassessment of false positives and false negatives.

Figure 1 .
Figure 1.Study flowchart: (a) data curation; (b) upload the study-managed dataset to Google Cloud Storage; (c) balanced data sets S, M, and L, the size of which depends on the minority class (rEF, mEF, and npEF, respectively); (d): video classifications to test; (e): create a per-model annotation dataset and upload the corresponding CSV files (one for the train and one for the test videos); (f) specify the model to be trained; (g) run the model for training; (h) analyze the test set; (i) human panel reassessment of false positives and false negatives.

Figure 2 .
Figure 2. Frames from videos pertaining to the study dataset.End-diastolic frames are shown on the upper part of the slide and end-systolic frames below: (a) 0X2753C50A8B05D7D5.avi,label pEF, EF 58.3% by Simpson method; (b) 0X2F3141F00A232601.avi,label mEF, EF 42.1% by Simpson method; (c) 0X41563E2CC2230C0E.avi,label rEF, EF 21.9% by Simpson method.

Figure 2 .
Figure 2. Frames from videos pertaining to the study dataset.End-diastolic frames are shown on the upper part of the slide and end-systolic frames below: (a) 0X2753C50A8B05D7D5.avi,label pEF, EF 58.3% by Simpson method; (b) 0X2F3141F00A232601.avi,label mEF, EF 42.1% by Simpson method; (c) 0X41563E2CC2230C0E.avi,label rEF, EF 21.9% by Simpson method.

Figure 3 .
Figure 3. Distribution of EF in the 7380 videos of the intermediate dataset before balancing to create the S, M, and L sets.(a) Frequency histogram; (b) Cumulative histogram.

Figure 3 .
Figure 3. Distribution of EF in the 7380 videos of the intermediate dataset before balancing to create the S, M, and L sets.(a) Frequency histogram; (b) Cumulative histogram.

3. 1 . 1 .
rEF vs. nrEF Results obtained with the three datasets are shown in terms of confusion matrices in Figure 4 and in terms of PRC and PRTC in Figure 5. Please note that the rEF class is the minority in these models.The S-set obtained by undersampling the mEF and pEF classes is the appropriate, balanced option for this binary classification (S-R-NR model

3. 1 . 1 .
rEF vs. nrEF Results obtained with the three datasets are shown in terms of confusion matrices in Figure 4 and in terms of PRC and PRTC in Figure 5. Please note that the rEF class is the minority in these models.The S-set obtained by undersampling the mEF and pEF classes is the appropriate, balanced option for this binary classification (S-R-NR model).

Figure 4 .
Figure 4. Confusion matrices observed for predicting rEF vs. nrEF using three differently balanced datasets.

Figure 6 .
Figure 6.Confusion matrices observed for predicting npEF vs. pEF using three differently balanced datasets.

Figure 8 .
Figure 8. ROC curves observed in the six experiments of binary classification, along with the corre-
Figure8shows the ROC curves for the six models with binary classification.Pearson's correlation was 0.909 between ROC AUC scores and PRC AUC scores.

Figure 8 .
Figure 8. ROC curves observed in the six experiments of binary classification, along with the corresponding AUC: (a) prediction of rEF vs. nrEF; (b) prediction of npEF vs. pEF.

Figure 8 .
Figure 8. ROC curves observed in the six experiments of binary classification, along with the corresponding AUC: (a) prediction of rEF vs. nrEF; (b) prediction of npEF vs. pEF.

Figure 9 .
Figure 9. Confusion matrices observed in a ternary classification of the classes rEF, mEF, and pEF using three differently balanced datasets.

Figure 9 .
Figure 9. Confusion matrices observed in a ternary classification of the classes rEF, mEF, and pEF using three differently balanced datasets.

Figure 11 .
Figure 11.Cross-tables comparing labels obtained by the Simpson method with panel majority in the 107 reviewed videos: (a) ternary classification; (b) binary classification, rEF vs. nrEF; (c) binary classification, npEF vs. pEF.The Q-score was not significantly different in the subgroups of EF overestimation (n = 43) and underestimation (n = 63) by classifier (means ± SD: 1.88 ± 1.69 vs.1.65± 1.41, 104 df, t = 0.767, p = 0.444).The A-score was significantly higher in the overestimation subgroup (means ± SD: 0.70 ± 1.28 vs. 0.27 ± 0.79, 104 df, t = 2.128, p = 0.036).Figure12presents three misclassified cases reported of poor quality by at least three members of our re-evaluation panel.The original labels were, respectively, pEF, mEF, and rEF.The comparison with correctly classified cases of good quality, as those presented in Figure2,

Figure 12 .
Figure 12.Frames from misclassified videos.End-diastolic frames are shown on the upper part of the slide and end-systolic frames below: (a) 0X41ECEC7AAEEFD0E6.avi,label pEF, EF 57.3% by Simpson method, pEF by unanimous panel, Q-score 5, A-score 0, misclassified as rEF; (b) 0X7923B6B4614AF456.avi,label mEF, EF 49.1% by Simpson method, mEF for four raters and rEF for one rater, Q-score 3, A-score 0, misclassified as rEF; (c) 0X3503A92D7637451.avi,label rEF, EF 31.8% by Simpson method, rEF by unanimous panel, Q-score 3, A-score 0, misclassified as nrEF.

Figure 12 .
Figure 12.Frames from misclassified videos.End-diastolic frames are shown on the upper part of the slide and end-systolic frames below: (a) 0X41ECEC7AAEEFD0E6.avi,label pEF, EF 57.3% by Simpson method, pEF by unanimous panel, Q-score 5, A-score 0, misclassified as rEF; (b) 0X7923B6B4614AF456.avi,label mEF, EF 49.1% by Simpson method, mEF for four raters and rEF for one rater, Q-score 3, A-score 0, misclassified as rEF; (c) 0X3503A92D7637451.avi,label rEF, EF 31.8% by Simpson method, rEF by unanimous panel, Q-score 3, A-score 0, misclassified as nrEF.
The final weights for a prediction model were those corresponding to the first occurrence of the minimum loss observed during the validation phases.The loss curves concerning the rEF vs. nrEF classifier are shown in FigureA1.The execution times for models A0 modified, A0 extended, A1 modified, and A1 extended were 273 min, 265 min, 507 min, and 508 min, respectively.

Figure A1 . 22 Figure A1 .
Figure A1.Loss curves observed for the rEF vs. nrEF classifier.The MoViNet models A0 and A1 were each tested either with a modification involving the last Conv3d layer or by adding a linear layer to the backbone.The loss curves concerning the npEF vs. pEF classifier are shown in Figure A2.The execution times for models A0 modified and A0 extended were 580 min and 525 min, respectively.

Figure A2 .
Figure A2.Loss curves observed for the npEF vs. pEF classifier using the MoViNet model A0 after backbone modification or extension.

Figure A2 .
Figure A2.Loss curves observed for the npEF vs. pEF classifier using the MoViNet model A0 after backbone modification or extension.

Table 2 .
Model registry of video classification by AutoML, with job duration.
* Appears under the name "average precision" in Vertex AI output. ).
* Appears under the name "average precision" in Vertex AI output.

Table 4 .
Metrics observed in ternary video classification.
* Appears under the name "average precision" in Vertex AI output.Diagnostics 2024, 14, x FOR PEER REVIEW 9 of 22

Table 4 .
Metrics observed in ternary video classification.
* Appears under the name "average precision" in Vertex AI output.
EF: left ventricular ejection fraction obtained by the Simpson method.Pred: EF class predicted by classifier(s).Raters: Q indicates that the rater reported poor technical quality; A indicates that the rater reported arrhythmia.Panel: five-rater judgement.

Table A3 .
Metrics observed in binary video classification for the test dataset.